[Spec Decode] Add hidden states extraction system by fynnsu · Pull Request #33736 · vllm-project/vllm

fynnsu · 2026-02-03T22:18:06Z

Purpose

In-tree implementation of hidden states extraction system described in #33118.

FIX #33318

Components

Configs

vllm/config/speculative.py: Add handling for extract_hidden_states spec method
vllm/transformers_utils/configs/extract_hidden_states.py: ExtractHiddenStatesConfig def

ExampleHiddenStatesConnector

vllm/distributed/kv_transfer/kv_connector/v1/example_hidden_states_connector.py: Connector definition
vllm/distributed/kv_transfer/kv_connector/factory.py: Registration
This is a custom connector for extracting the hidden states. This is only designed to work when using extract_hidden_states as a speculative method. It also only extracts data from CacheOnlyAttentionLayers (which is what the dummy model stores the hidden states in).

This is not the most performant connector, but provides a good debug implementation and starting point for future hidden states connectors.

KV Connector for ExtractHiddenStates dummy model

vllm/v1/outputs.py: Add support for merging KVConnectorOutputs
vllm/distributed/kv_events.py: Add support for merging KVConnectorEvents
vllm/v1/worker/gpu_model_runner.py: Add KVConnectorOutput merge when using extract_hidden_states

Currently, there isn't support for KV Connector + draft models. The KVConnector context is only initialized for the verifier model's forward, but then doesn't exist during the draft model call, preventing the KVConnector from being used. This is a problem beyond this pr, because it prevents P/D disagg w/ spec decoding.

However, for the scope of this pr, I have developed a temporary solution for this extraction pathway, which is to set up a second KVConnector context when calling the draft model. This is safe to do, because the custom ExampleHiddenStatesConenctor only saves the state from CacheOnlyAttentionLayers which only exist in the drafter. I then merge the two KVConnectorOutputs together so that the scheduler gets complete information from the two contexts.

ExtractHiddenStatesModel

vllm/model_executor/models/extract_hidden_states.py: Model definition + custom attention backend/layer definitions
vllm/model_executor/models/registry.py: Registration

Dummy "drafter" model that just allocates kv cache space in fake attention layers and then caches its inputs there, while triggering kv connector load/save.

Note: we set num_heads to the number of hidden layers we're saving and head_size to the model's hidden_size so that the data can be inserted without additional reshaping/splitting across dummy layers.

This model also checks eagle_aux_hidden_state_layer_ids to determine how many layers will be cached.

ExtractHiddenStatesProposer

vllm/v1/spec_decode/extract_hidden_states.py: Proposer def
vllm/v1/worker/gpu_model_runner.py: Proposer integration (matches if/else integration of existing proposers)

Handles setting up context (kv connector and forward) for call to fake drafter model. Handles "dummy" drafting process (i.e. returning original tokens from target + dummy drafts).

Test Plan

examples/offline_inference/extract_hidden_states.py: Example script showing correct usage and how to extract the save path from the model output and load the hidden states.
tests/v1/kv_connector/extract_hidden_states_integration/: Integration tests for hidden states extraction system
tests/v1/spec_decode/test_extract_hidden_states.py: Unit tests for ExtractHiddenStatesProposer

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2026-02-03T22:18:44Z

Documentation preview: https://vllm--33736.org.readthedocs.build/en/33736/

gemini-code-assist

Code Review

This pull request introduces a new speculative decoding method, extract_hidden_states, designed to extract and save hidden states from a model. The implementation is comprehensive, adding a new proposer, a dummy model with a cache-only attention mechanism, a corresponding KV connector, and an example script. The changes are well-integrated with the existing speculative decoding framework. My main feedback is to remove a debugging print statement from the configuration logic to ensure clean logs for users.

vllm/config/speculative.py

vllm/v1/spec_decode/extract_hidden_states.py

examples/offline_inference/extract_hidden_states.py

vllm/model_executor/models/extract_hidden_states.py

vllm/v1/attention/backends/flash_attn.py